Maaslin 2 Analysis of GO Terms for COVIRT19

Lets install some R packages that we are gonna need to run this analysis

if(!requireNamespace("BiocManager", quietly = TRUE))

install.packages("BiocManager")

Now lets load our libraries and set out environment

A tibble: 47,233 x 2,020 # good so far now do a little regex and fix some stuff

Transform the raw table by type of count (euk, term, bac, arc)

Make individual tibbles for biological processes and molecular fxn

make individual tibbles for each type (bac, euk, term, arc, vir, etc)

subselect tibbles for only the counts and go terminology

convert them to dataframes for downstream import to phylsoeq

convert the dataframes into phyloseq formats

import your metadata

a little regex to fix the stupid filename

I want to take a moment and thank the curators for all their hard work in annotation the metadata........ =)

making physeq object

Filter out the top level depth 0 GO Terms "molecular function" and "biological process"

filter out the negative control and unknown samples

Lets also remove the Micalovich samples for now

Lets change the names of the Go Terms so we can understand the description as well as the tag

This code was causing mismatches with name and GO TAG and has since been resolved 19 NOV 2020

NEW and improved code

DESeq2 VST transformation

YAAAAAAAAAAAAAAAASSSSSSSSSSSSS THANK YOU LIMMMA

Dont worry about the limma batch effect correction step, I think I found a better way by including it in the multivariate model

IT FIXED THE BATCH EFFECT!

convert the normalized counts to a phyloseq object and transform into relative abundances

MaAsLIN2

dir.create("R_Maaslin2") # Create a new directory

setwd("/home/jovyan/work/Jochum_3/jupyter_lab/GO_term_analysis/R_Maaslin2/") # Change the current working directory

getwd() #check if directory has been successfully changed

ok so here are the parameters you want to manipulate:

min abundance= the min rel abund hits (1%) #filters out XXXX GO_terms \ min prevalence = Min samples required with min abundance for a feature not to be filtered (0.1=10%=14.1000 samples) \ max_significance = the maximinum p adjusted value to be significant \

This will filter out 13770 GO TERMS \

normalization = CLR transformation \

CORRECTION dont normalize here, just use the VST transformed counts

correction = the mutliple test correction method to be done (BH=Benjamini-Hochberg)

This is what is looks like if we feed the VST normalized counts into maaslin and get log scale outputs

`case<-Maaslin2( input_data = df_input_data2, input_metadata = df_input_metadata2, output="./vst_case_no_sick", min_abundance = 3, min_prevalence = 0.1, #minimimum presences 10.5 samples normalization = "NONE", transform = "NONE", analysis_method = "LM", max_significance = 0.001,

random_effects = c("sample_name","publication","collection_location","sequence_type"),

random_effects = c("sample_name","publication"), fixed_effects = c("case"), correction="BH", standardize = TRUE, cores = 48, plot_heatmap = TRUE, plot_scatter = TRUE, heatmap_first_n = 100, reference=c("case,COVID19"))`

Ok this isnt working, I cant figure out how to not get >500 hits with the vst physeq object, and it doend tmake a lot of sense that everything is wayyy up but w/e.

I think we could take these results and then go make boxplots of the compositionally transformed counts

Lets see what this looks like if we just let maaslin2 do the normalization (CLR) on compositionally transformed counts

wow that looks great, lets do it again for outcome UPDATE: I decided to do this later with a subset of only COVID19 positive samples

`outcome<-Maaslin2( input_data = df_input_data, input_metadata = df_input_metadata, output="./outcome", min_abundance = 0.01, min_prevalence = 0.1, normalization = "CLR", transform = "NONE", analysis_method = "LM", max_significance = 0.05,

random_effects = c("sample_name","publication","collection_location","sequence_type"),

random_effects = c("sample_name","publication"), fixed_effects = c("outcome"), correction="BH", standardize = TRUE, cores = 48, plot_heatmap = TRUE, plot_scatter = TRUE, heatmap_first_n = 100, reference=c("outcome,Deceased"))`

MaAsLin2 outcome analyisis (pruned_samples)

ok thats interesting it looks like it just through everything else into NA regardless of case... I guess lets trying it again with only covid cases

DMM modeling using the MaAslin2 derived terms

Ok, lets import the MaAsLin2 derived significant terms

DMM Preprocessing /filtering

Ok lets filter out the GO_tag mataches from our phyloseq object

Ok lets fix our names again

DMM modeling time

convert counts to a matrix

Fit the dmm model

Check the model fit with different number of mixture componenets using standard information criteria

make a heatmap visualization of the cluster

log 2 Heatmap

square root version

print out the theta values

save the datasheet that show which GO terms contributed to each dmm group

save a datasheet that identifies which sample belongs to which dmm group

I added these filtering commands to pull out the counts with less than 1% relabund or greater tahn 22% (ie:molecular function)

make the balloon plot

DATA VISUALIZATION TIME and outcome comparison time

ok lets do a binomial test to see if we can use this

YAAAASSSSSS CASE ~OUTCOME Chi-squared TEST P=0.0517 LESGOOOO

HEATMAP TIME

OK GREAT JOB

NEXT STEPS:

REMOVE THE MIKALOVICH SAMPLES (DONE)

ADD A COLUMN FOR OUTCOME ON THE HEATMAP (DONE)